Sequencing and Raw Sequence Data Quality Control    ◾    41

paste <(cat bad_filt_trim_clip.fastq | paste - - - -) \

| awk -v FS=’\t’ ‘length($2) >= 150 && length($4) >= 150’ \

| tee >(cut -f 1-4 | tr ‘\t’ ‘\n’ > bad_filt_trim_clip_eq.fastq)

fastqc bad_filt_trim_clip_eq.fastq

htmlfiles=$(ls *.html)

firefox $htmlfiles

If you use the above script for other FASTQ files, you may need to change the read length

and the numbers of the columns. In our example FASTQ file, “$2” is for the sequence col-

umn and “$4” is for the quality column. The numbers of these columns may vary depend-

ing on the content of the FASTQ definition line.

Figure 1.35 shows the per base sequence quality of the final FASTQ file. You should

remember that you may not be able to fix all quality problems, and that filtering and clip-

ping may compromise the sequencing depth. Fortunately, most of the problems other than

base quality errors can be tolerated by the majority of the aligning programs. However, we

should try to fix the failed metrics as possible before continuing to the subsequent step of

the analysis.

FIGURE 1.35  Cleaned FASTQ file.